Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
Introduction to Open Data Science is a course held in 2017 for the second time. It promices to give all the skills to call themselves data scientists. Bold.
Describe the work you have done this week and summarize your learning.
This week I did the regression homework. Having done the DataCamp exercises there was very little new to do in the RStudio exercise itself. Seems a bit pointless to do the RStudio exercise as you could pretty much copy-paste the code from the DataCamp and be done. To make things a bit more complicated for myself I added some complexity not requested in the exercise.
Data wrangling I had done before, so that was not new, but a good reminder. I had done regression with R before, but the validation/prediction stuff was new. I also found that rather than directly coding in .Rmd one should code in .R. Seems easier at least.
In this part data is loaded into a data frame. Information about the dataset can be found in here. In brief it has data on students, their attitudes, learning styles, basic demographics and exam scores.
# Read data from a local comma separated values file.
learning2014 <- read.csv("data/learning2014.csv")
# Get dimensions of the data (166 students and 8 observations). Get structure.
dim(learning2014)
## [1] 166 8
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
# The dataset has been collapsed from a dataset measured by Kimmo Vehkalahti.
# Deep, stra and surf refer to deep, strategic and surface learning values normalized to 1-5.
Do some graphs and tables.
# Show a graphical overview of the data and show summaries of the variables in the data. X is the numeric order of the students.
library(ggplot2)
library(GGally)
# Text table of the dataset
summary(learning2014)
## X gender age attitude deep
## Min. : 1.00 F:110 Min. :17.00 Min. :1.400 Min. :1.583
## 1st Qu.: 42.25 M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Median : 83.50 Median :22.00 Median :3.200 Median :3.667
## Mean : 83.50 Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :166.00 Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
# initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points))
# define the visualization type (points)
p2 <- p1 + geom_point()
# add a regression line
p3 <- p2 + geom_smooth(method = "lm")
# add a main title and draw the plot
p4 <- p3 + ggtitle("Student's attitude versus exam points")
# add equation and r-squared to the graph after linear model has been done.
# create a more advanced plot matrix with ggpairs()
pp <- ggpairs(learning2014[-1], mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
pp
shapiro.test(learning2014$deep)
##
## Shapiro-Wilk normality test
##
## data: learning2014$deep
## W = 0.9812, p-value = 0.02371
shapiro.test(learning2014$stra)
##
## Shapiro-Wilk normality test
##
## data: learning2014$stra
## W = 0.99068, p-value = 0.3504
shapiro.test(learning2014$surf)
##
## Shapiro-Wilk normality test
##
## data: learning2014$surf
## W = 0.99219, p-value = 0.5073
shapiro.test(learning2014$points)
##
## Shapiro-Wilk normality test
##
## data: learning2014$points
## W = 0.96898, p-value = 0.0008932
Description of the data: There are more females than males. Mean age of the students is 25.5 years and ranges from 17-55. Mean attitude is 3.1. Strategy attributes look like they are roughly normally distributed. However that is not the case for deep and points. The normality test was done with Shapiro-Wilks normality test. If it gives p<0.05 then usually normality is not expected to hold. Most of the attributes do not correlate much. Biggest correlations are with points and deep learning strategy with both genders. Also with males (not females) surface learning had a strong negative correlation with deep learning strategy.
Explore how well different student attributes predict their exam points.
## Perform an explanatory model with a regression model. Choose three variables. Drop non-significant ones and redo.
# Do a linear model.
lModel <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(lModel)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
# Surface learning is not even close to significant. The strategic learning is marginally significant. So drop surf.
lModel <- lm(points ~ attitude + stra, data = learning2014)
summary(lModel)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
# After dropping surf strategic learning is also p>0.05 so drop that also. Only attitude remains.
lModel <- lm(points ~ attitude, data = learning2014)
summary(lModel)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
# Add regression formula to the attitude vs points plot.
# Function to get regression formula.
lm_eqn = function(m) {
l <- list(a = format(coef(m)[1], digits = 2),
b = format(abs(coef(m)[2]), digits = 2),
r2 = format(summary(m)$r.squared, digits = 3));
if (coef(m)[2] >= 0) {
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
} else {
eq <- substitute(italic(y) == a - b %.% italic(x)*","~~italic(r)^2~"="~r2,l)
}
as.character(as.expression(eq));
}
p5 = p4 + annotate("text", x = 2.1, y = 34, label = lm_eqn(lModel), colour="black", size = 5, parse=TRUE)
p5
Linear regression can be described with a formula dependent = alpha + beta(explanatoryVariable) + noise One can have more than one explanatory variables. In this case only attitude remains as a statistically significant explanatory variable. In the current model alpha is 11.6 and beta is 3.5. That is each attitude attribute point is worth 3.5 points in the exam.
Interpretation.
Multiple R-squared value is 0.19. Multiple R-squared is known as R-squared. It tells how well the model fits the data. A value of 1 means the model perfectly predicts data, whereas 0 means there is no predictive value. The value of 19 % means that roughly 19 percent of the course score in the learning2014 dataset is explained by the attitude rating. It can help to think that R-squared is correlation coefficient x correlation coefficient. If one was using multiple variables using the Adjusted R-squared value would be recommended.
Validation of model sanity.
## Check if model fits well or not.
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5.
# Draw three panels in a single figure.
par(mfrow = c(1,3))
# Plot Residuals vs Fitted values (1), Normal QQ-plot (2), and Residuals vs Leverage (5).
plot(lModel, which = c(1,2,5))
Linear regression model assumes linearly distributed data and normally distributed errors. Errors should also have constant variance. One can display graphs to test how well the model fits the assumption.
Residuals vs Fitted plot is used to check if variance is constant or not. If the data points are roughly evenly distributed good, if the data has a clear shape then bad. Residuals vs Fitted is roughly evenly distributed, so the size of errors should not depend on the explanatory variables.
Q-Q plot is used to measure whether errors are normally distributed. If points follow the line then normality can be presumed. Q-Q plot seems reasonably near the assumption of normal error distribution.
Residuals vs Leverage plot is used to check for outlier observations. If there are observations with particularly high leverage one should consider removing them from the analysis. Leverage plot seems like there is no outlier observation (so no very high leverage observations).
All in all the model seems reasonable.
# Load required libraries
library(ggplot2)
library(dplyr)
library(boot)
# Load data from .csv
alc <- read.csv("data/student-alc.csv")
# Column names
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "nursery" "internet" "guardian"
## [16] "traveltime" "studytime" "failures" "schoolsup" "famsup"
## [21] "paid" "activities" "higher" "romantic" "famrel"
## [26] "freetime" "goout" "Dalc" "Walc" "health"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
Description of the data from the authors (truncated): This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades.
Link to the data and a fuller description.
Choose four interesting variables and present hypothesis about their relationship with alcohol consumption. I did not select DataCamp ones as the results are known already, except the final grade as that is an obvious one.
19 activities - extra-curricular activities (binary: yes or no). I would imagine that activities predicts low alcohol use. Might not, as for example hockey youths in Finland consume a lot of alcohol. Interesting to see if there is any relation to high use. I would guess a marginally significant relation.
21 higher - wants to take higher education (binary: yes or no). Those with ambition should have less alcohol use if they are serious. Wanting to take higher education though is easy. Again, interesting to see any relation to high use. I would guess a low but still significant relation.
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). Having poor family relationship I would imagine predicts high alcohol use. I would guess this to have the strongest relation to high alcohol use of my variable selections.
32 final grade (numeric: from 0 to 20, output target). Final grade should have some predictive value, but not very strong. At least as far as I can remember from the DataCamp exercise. This selection is a bit boring.
## Do some exploratory search with the chosen variables and alcohol use.
# Draw a bar plot of high_use by activities. Gives graphs divided by activities.
g1 <- ggplot(alc, aes(x = high_use))
g1 + geom_bar() + facet_wrap("activities") + ggtitle("Students divided by extra activities (no/yes) to high users (F/T)") + xlab("High alcohol use (False/True)")
19 activities. Looks like there is a small effect of students having extra activities having less heavy alcohol users. The difference between active students and non-activity students is not large.
# Draw a bar plot of high education aspirations and high alcohol use.
g2 <- ggplot(alc, aes(x = high_use))
g2 + geom_bar() + facet_wrap("higher") + ggtitle("Students divided by wanting to take higher education (no/yes) to high users (F/T)") + xlab("High alcohol use (False/True)")
# 21. Education. There are very few people not wanting to take higher education, so this variable should be very pointless in future analyses.
alc %>% group_by(higher, high_use) %>% summarise(count = n())
## # A tibble: 4 x 3
## # Groups: higher [?]
## higher high_use count
## <fctr> <lgl> <int>
## 1 no FALSE 9
## 2 no TRUE 9
## 3 yes FALSE 259
## 4 yes TRUE 105
There are only 18 out of the 382 that are not interested in higher education. A bit surprising. High use was 1:1 in those not wanting to take higher education so at least my hypothesis holds, kind of. The n is so small one should not make statistical inferences. The ratio amongst those wanting to take higher education is about 1:3.5, much less than 1:1.
# 24. Family relationship.
# Do a histogram of relationship levels by high alcohol use.
g3 <- ggplot(alc, aes(x = high_use))
g3 + geom_bar() + facet_wrap("famrel") + ggtitle("Students divided by family relationships (1-5) to high users (F/T)") + xlab("High alcohol use (False/True)")
There seems to be more high alcohol users in those reporting poor family relationships. Fortunately those reporting very low relationships are a minority. Hypothesis stands strong.
# 32. Final grade.
# Do boxplots of high and low alcohol users and their final grades (G3).
g4 <- ggplot(alc, aes(x = high_use, y = G3))
g4 + geom_boxplot() + ggtitle("Students' final grades grouped by alcohol use") + xlab("High alcohol use (False/True)") + ylab("Final grade")
There is a small difference between the high and low users. High users of alcohol have worse grades on average than low users. There is large overlap between the groups though.
## Doing the logistic regression model.
lrModel <- glm(high_use ~ activities + higher + famrel + G3, data = alc, family = "binomial")
summary(lrModel)
##
## Call:
## glm(formula = high_use ~ activities + higher + famrel + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3793 -0.8390 -0.7433 1.3588 1.9027
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.48228 0.71570 2.071 0.0383 *
## activitiesyes -0.16942 0.22912 -0.739 0.4596
## higheryes -0.53716 0.51831 -1.036 0.3000
## famrel -0.24829 0.12285 -2.021 0.0433 *
## G3 -0.06856 0.03592 -1.909 0.0563 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 453.42 on 377 degrees of freedom
## AIC: 463.42
##
## Number of Fisher Scoring iterations: 4
The model has only two variables that seem worthwhile: family relationships and the final grade. Activities and higher education aspirations have a low chance of being significant in predicting high alcohol use. Having good family relationship and good final grade predicts lower alcohol use in the model.
# Compute odds ratios and confidence intervals from the model's coefficients
oddsR <- coef(lrModel) %>% exp
confI <- confint(lrModel) %>% exp
## Waiting for profiling to be done...
# Print odds ratios and confidence intervals. The confidence intervals of activities, higher education aspirations, and final grades span 1, meaning that there might be no predictive value with those variables. The family relations almost get to 1 also.
cbind(oddsR, confI)
## oddsR 2.5 % 97.5 %
## (Intercept) 4.4029534 1.0892712 18.2873241
## activitiesyes 0.8441515 0.5383786 1.3236738
## higheryes 0.5844052 0.2094452 1.6406627
## famrel 0.7801305 0.6125070 0.9932711
## G3 0.9337328 0.8697591 1.0017032
The hypotheses for family relationships and final gradesh still hold, but rather tenuously if one considers the confidence intervals. Final grades have odds ratio of 1:1.07 so while it might be marginally significant it is of little practical value. I’d say only the family relationships variable has any value.
Explore the predictive power of the model with the significant variables. I’m including only the family relationships as the final grades had only a marginally significant probability of having a significant parameter in the model, and also spanning 1 in it’s confidence interval.
# Refit the model with only family relationships
strippedModel <- glm(high_use ~ famrel, data = alc, family = "binomial")
# Predict the high use of alcohol using the stripped down model and the original (for interests sake).
predHighStripped <- predict(strippedModel, type = "response")
predHighAll <- predict(lrModel, type = "response")
# Add the probabilities to data structure alc
alc <- mutate(alc, probStripped = predHighStripped)
alc <- mutate(alc, probAll = predHighAll)
# Make predictions of high use
alc <- mutate(alc, predStripped = probStripped > 0.5)
alc <- mutate(alc, predAll = probAll > 0.5)
# Check that code was fine. Look at the first twenty values
select(alc, famrel, high_use, probStripped, predStripped, probAll, predAll) %>% head(20)
## famrel high_use probStripped predStripped probAll predAll
## 1 4 FALSE 0.2927178 FALSE 0.3551263 FALSE
## 2 5 FALSE 0.2410230 FALSE 0.3005091 FALSE
## 3 4 TRUE 0.2927178 FALSE 0.3095389 FALSE
## 4 3 FALSE 0.3503816 FALSE 0.2831074 FALSE
## 5 4 FALSE 0.2927178 FALSE 0.2950794 FALSE
## 6 5 FALSE 0.2410230 FALSE 0.1937715 FALSE
## 7 4 FALSE 0.2927178 FALSE 0.2950794 FALSE
## 8 4 FALSE 0.2927178 FALSE 0.3243810 FALSE
## 9 4 FALSE 0.2927178 FALSE 0.2171708 FALSE
## 10 5 FALSE 0.2410230 FALSE 0.1937715 FALSE
## 11 3 FALSE 0.3503816 FALSE 0.3492025 FALSE
## 12 5 FALSE 0.2410230 FALSE 0.2160970 FALSE
## 13 4 FALSE 0.2927178 FALSE 0.2480893 FALSE
## 14 5 FALSE 0.2410230 FALSE 0.2461718 FALSE
## 15 4 FALSE 0.2927178 FALSE 0.2413852 FALSE
## 16 4 FALSE 0.2927178 FALSE 0.2413852 FALSE
## 17 3 FALSE 0.3503816 FALSE 0.2831074 FALSE
## 18 5 FALSE 0.2410230 FALSE 0.2160970 FALSE
## 19 5 TRUE 0.2410230 FALSE 0.2937649 FALSE
## 20 3 FALSE 0.3503816 FALSE 0.3266436 FALSE
table(high_use = alc$high_use, prediction = alc$predStripped)
## prediction
## high_use FALSE
## FALSE 268
## TRUE 114
# Hmm. There seems to be no high users predicted by the stripped down model. Count n of TRUE values in predStripped to make sure.
sum(alc$predStripped)
## [1] 0
# No True values... Let's see then how the full model would fare.
table(high_use = alc$high_use, prediction = alc$predAll)
## prediction
## high_use FALSE TRUE
## FALSE 258 10
## TRUE 112 2
# Better, but not great. The sensitivity of the model with all variables is terrible (2/114), and the specificity is fine (258/268).
# Lets compute the the total proportion of inaccurately classified individuals for hoots and giggles. Using the full model as the stripped version would be a bit boring to test.
nIndividuals <- nrow(alc)
nIncorrectPrediction <- sum(alc$high_use != alc$predAll)
nIncorrectPrediction/nIndividuals
## [1] 0.3193717
# The proportion of inaccurate predictions were 32 % (122 out of 382). This gives an inflated view of how good the model is, as this mostly is due to the model having very few predictions for high usage.
# Let's compare the model to a simple strategy of if family relationship is 2 or less, then predict high alcohol use.
alc$simpleClassify <- alc$famrel <= 2
table(high_use = alc$high_use, prediction = alc$simpleClassify)
## prediction
## high_use FALSE TRUE
## FALSE 252 16
## TRUE 103 11
nIncorrectPrediction <- sum(alc$high_use != alc$simpleClassify)
nIncorrectPrediction/nIndividuals
## [1] 0.3115183
The proportion of inaccurate predictions were 31 % (119 out of 382). The simple model was better than the full model, but not by much. The sensitivity increased to almost 50 % though which is much better than the full model. Less than 50 % chance is still not impressive though.
Perform a 10-fold cross-validation of the full model. One needs to define a loss function for the cross-validation code (cv.glm).
# Define loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$predAll)
## [1] 0.3193717
# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = lrModel, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3193717
My full model has worse performance than the model in DataCamp, that is 0.32 vs 0.26, lower values being better.
Boston data has 506 observations and 14 variables. It has information about housing in suburbs of Boston.
Description of variables in the Boston dataset:
crim - per capita crime rate by town.
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox - nitrogen oxides concentration (parts per 10 million).
rm - average number of rooms per dwelling.
age - proportion of owner-occupied units built prior to 1940.
dis - weighted mean of distances to five Boston employment centres.
rad - index of accessibility to radial highways.
tax - full-value property-tax rate per $10,000.
ptratio - pupil-teacher ratio by town.
black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat - lower status of the population (percent).
medv - median value of owner-occupied homes in $1000s.
# Load required libraries
library(ggplot2)
library(dplyr)
library(corrplot)
library(GGally)
library(tidyr)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data('Boston')
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
# print the correlation matrix
cor_matrix<-cor(Boston)
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# Histograms of the variables
Boston %>%
gather(key=var_name, value = value) %>%
ggplot(aes(x=value)) +
geom_histogram() +
facet_wrap(~var_name, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Create color ramp from dark blue to white to red.
colorVector <- c("blue", "white", "red")
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6, col = colorRampPalette(colorVector)(200))
The histogram plot shows that much of the data does not look like gaussian normally distributed data. Most of the data variables has a high kurtosis, or has two peaks. The correlation plot shows that there are many high positive correlations between different variables: like industy and NO2 gas level and tax revenue; property taxes and access to radial highways. There are some strong negative correlations ones also like median value of owner-occupied homes and lower status of the population; and between distances to five Boston employment centres and proportion of owner-occupied units built prior to 1940.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# change the object to data frame from matrix type.
boston_scaled <- as.data.frame(boston_scaled)
sd(boston_scaled$crim)
## [1] 1
# Now all the variables have zero mean and SD of 1. This is recommended for clustering.
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime. The bins have either 127 or 126 elements.
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# Select 80 % of the data and 20 % of the data and split them into separate datasets.
# number of rows in the Boston dataset
n <- nrow(Boston)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create training set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis. The . means all of the variables.
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object. Note that the first linear discriminant explains 95 % of the model, and the third one only 1.2 %.
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2500000 0.2549505 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 1.0423274 -0.8678430 -0.07742312 -0.8816485 0.4337239
## med_low -0.1147185 -0.2710721 -0.11640431 -0.5698482 -0.1714675
## med_high -0.3690158 0.1875017 0.18636222 0.4265805 0.1367102
## high -0.4872402 1.0171737 -0.03371693 1.0557275 -0.4215955
## age dis rad tax ptratio
## low -0.9007856 0.9021005 -0.6862262 -0.7545979 -0.42718882
## med_low -0.3401523 0.3225994 -0.5543231 -0.4474704 0.02602726
## med_high 0.4183670 -0.3850883 -0.4578134 -0.3458330 -0.36826847
## high 0.8122706 -0.8568075 1.6375616 1.5136504 0.78011702
## black lstat medv
## low 0.3853581 -0.77055325 0.53093183
## med_low 0.3153846 -0.11981937 -0.03424823
## med_high 0.1008101 -0.03002669 0.23257738
## high -0.8595037 0.87240056 -0.68505075
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08851246 0.711820791 -0.959354954
## indus 0.06370071 -0.094789649 -0.006446562
## chas -0.10189359 -0.004689611 0.009694197
## nox 0.34901353 -0.814379698 -1.199469094
## rm -0.14126190 -0.119506833 -0.158758187
## age 0.11109882 -0.309666437 -0.221377230
## dis -0.04784935 -0.188704053 -0.010048427
## rad 3.69530737 0.944819153 -0.472012422
## tax 0.03993662 -0.044292516 0.957906081
## ptratio 0.08951580 -0.035962575 -0.164749976
## black -0.09894491 0.033874431 0.096940572
## lstat 0.22931260 -0.238059776 0.369499471
## medv 0.19885938 -0.446373331 -0.115564735
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9524 0.0354 0.0122
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results. Both lines need to be run at the same time. High class of crime rate looks to cluster pretty well with only a few medium high class elements in it, and only one high class element being at LD1 value of ~0.
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
# Saving of the crime variable in a vector was done in step 5. Remove the crime variable from test data not necessary either.
# test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 11 4 0
## med_low 4 17 4 0
## med_high 0 7 13 3
## high 0 0 0 28
The categorization of the low and especially high classes of properties was quite successful. The medium low and medium high classes were only about 60 % correct. There was more confusion about the low crime rate properties compared to the high crime rating properties. Having shifting values depending on different runs of the script is annoing.
# Reload Boston dataset.
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame from matrix type.
boston_scaled <- as.data.frame(boston_scaled)
# Calculate the Euclidean distances between observations.
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# K-means clustering
km <-kmeans(boston_scaled, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
# Determening how many clusters to have using within cluster sum of squares (WCSS). Set a seed to make iterations give the same result.
set.seed(123)
# determine the number of clusters using the total within sum of squares (twcss)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results. Two or 3 seems reasonable. For visualization purposes 2 is handier so lets go with that.
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
# plot the normalized Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
# Tried to use ggpairs but that didn't manage to color with the clusters. One does get the correlation values and distributions with ggpairs though.
ggpairs(boston_scaled)
Per capita crime rate by town (crime) is correlated most strongly with accessibility to radial highways (rad, 0.63), full-value property-tax rate (tax, 0.58), and lower status of the population (lstat, 0.46). Rad and tax are very highly correlated (0.91), and rad and tax are both correlated somewhat strongly with lstat (~.45). Since the rad and tax are so highly correlated it is hard to disentangle what encourages crime in the area, good connections or wealth as measured by taxes. I would have imagined that lower status would have been negatively correlated with tax. The tax value is likely not a good indicator of the wealth in the area. I find it a bit hard to guess at what these correlations mean. One interesting value is that having higher proportion of blacks by town has a weak negative correlation with per capita crime rate.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
# Plot with crime classes as color of the training data.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# Plot with clusters of k-means as color of the training data.
# First one needs to do k-means with 4 clusters to compare the methods.
km3D <-kmeans(boston_scaled, centers = 4)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km3D$cluster[ind])
Comparison of the methods using eye-balling: the high crime rate cluster and third cluster of k-means are rather similar. The k-means clustering seems to work better as far as visuals go: less intermingling. The crime classes are more intermingled even to the point where low and medium high crime rate classes are interspersed. Cluster 1 and low somewhat match, but cluster 2 and 4 and medium high and medium low clusters do not match well. At least a one run of the file. Maybe not another. Note that you might need to left-click poke at the figures to see anything.